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THE VIRTUAL MEMORY IN THE STRETCH COMPUTER 



by John Cocke 

and 
Harwood G, Kolsky 



I. INTRODUCTION 

Early in the plajining of the Stretch computer it was seen that by us- 
ing the latest solid state components in sophisticated circuits it would 
be possible to increase tiie speed of floating point arithmetic by almost 
two orders of magnitude over that in existing computers. However, 
there seemed to be no possibility of developing on the same time- 
scale economiically feasible large memories with more than a factor 
of ten or perhaps twenty increase in speed. As a result, the proposed 
system appeared to be in danger of being seriously memiory-access 
limited. 

Moreover, as the speed of the floating point operations increases, a 
larger and larger percentage of the computer's time is spent on "para- 
sitic operations", i, e. , operations whose only function is program con- 
trol and data selection. It was obvious that a radically new machine 
organization was necessary in order to capitalize upon the possibilities 
opened up by the high arithmietic speeds in the presence of relatively 
slow memories. 

At this time, a number of persons were considering the possibility of 
a "look-ahead" device in which an independent indexing arithmietic unit 
would prepare the effective addresses of instructions and initiate mem- 
ory references to a multiplicity of memory boxes. The data thus fetch- 
ed would be held in high-speed buffer registers \mtil needed by the 
arithmetic unit. This device would serve two desirable purposes: 
(1) some of the parasitic operations would be done in parallel and thus 
not delay the principal calculations, and (2) several memory boxes 
could be running simultaneously, giving the effect of higher memory 
speed. 

Since our original work on the virtual memory and simulation in 1957-58, 
a large number of detailed changes have been made in the actual hard- 
ware design of Stretch. These necessitated several modifications in 



the simulation program to estimate their effect on the overall system 
performance. In this report we are omitting miany of these changes 
for expository reasons, since our purpose is to describe the virtual 
memory and timing simulation concepts, not to describe the Stretch 
hardware exactly. The result is that the system described below emi- 
bodies a more general system than that found in the simulator, which 
in turn is more general than that fo\ind in the actual computer. 



II. GENERAL DESCRIPTION OF THE SYSTEM 

The major logically-independent blocks of the Stretch computer are 
shown in Figure 1. Each of the \inits pictured may be considered as 
operating asynchronously. That is, each does its tasks as fast as possi- 
ble independently of the others. In theory, each box could have its 
own clocking circuits and still operate properly. In practice, for e- 
conomy's sake they are all timed by the same master oscillator, but 
this does not destroy their logical independence. 

The bus control unit serves as a routing agent between the memories 
and the various data processing units. If two or more units make a 
request simultaneously the control unit assigns priorities in the follow- 
ing order: (1) High-speed Exchange, (2) Basic Exchange, (3) Vir- 
tual Memory, and (4) Indexing Arithnaetic Unit. 

The Indexing Arithmetic Unit fetches instructions, performs all necessary 
indexing operations and sends the instructions to be executed to the 
Virtual Memory. 

The Virtual Memory fetches and receives the data required by the in- 
struction and holds this data luitil the arithmetic unit is ready for it. 
The virtual memory also performs all store operations. It holds the 
data generated by the arithmetic unit or indexing arithmetic unit until 
the memory to which the data must be sent is available. Thus the vir- 
tual memory acts not only as a "look-ahead" for instructions to be fed 
to the arithmetic unit, but also acts as a "look-behind" storage buffer. 

The actual design of such a "look-aJiead" device posed a number of logical 
problems, particularly in connection with conditional branches. 

However, a machine organization of this complexity requires a de- 
tailed timing analysis in order to determine the value of adding hard- 
ware in the form of the virtual memory. This is especially true since 
the sole function of the virtual memory is to increase miachine speed, 
by increasing the efficiency of other devices. It was also felt that the 
timing analysis could not be made on the basis of a few trivial exajnples 
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(e. g., matrix multiply). Machine performance obtained in this fashion 
can be extremely deceptive. Since a detailed timing analysis of a com- 
puter of this complexity is extremely tedious to carry out by hand, it 
beccime clear that if the job were to be done, it would be necessary to 
simulate the proposed machine on another computer. This prompted 
us to write the simulation program to be described later. 

With the above general organization in mind, let us discuss some of 
the logical problems posed by such a system. The first problem is a 
result of the very concept which enables us to obtain such great bene- 
fits from tiie stored prograim computer - the ability to treat instruc- 
tions as data. In a system such as we have proposed there is a large 
aimount of simultcineous operation. For example, the indexing arith- 
mietic unit may be busy preparing an instruction before previous in- 
structions have been completed or even started by the arithmetic unit. 
One of these previous instructions may modify the instruction which is 
presently being indexed. The virtual memory must recognize this 
situation and allow the intervening instructions to be completed before 
doing the modified instruction. 

A similar problem exists with respect to ordinary data. In order to 
operate several memories simultaneously, it is necessary to start 
obtaining data from these memories before the preceding operations 
have been completed. Yet, one of these operations may be a store in- 
to one of the data locations. The virtual memory must msuke provisions 
to insure that each instruction obtains the most up-to-date data as im- 
plied by the order of the prograim. 

One of the novel features of the Stretch computer is its elaborate in- 
terrupt system. Under this system, whenever some unexpected occur- 
rence arises, the program will be interrupted auid control will pass to 
a special routine which is designed to take care of the case in question, 
then return control to the original program. In this situation the vir- 
tual memory must have provisions to retain enough information so that 
when an interrupt occurs we can resume the computation exactiy where 
we left off. It must be able to recognize which of the chajiges that have 
been made in adveuice are not desired and should be obliterated, and 
which are exact solutions that must be restored. 

Another special case arises when a conditional brsmch on arithmetic re- 
sults occurs. Here we will not know which of the two branches we 
should have taken until the preceding instruction is executed. In the 
case where the wrong path has been selected, the virtual memory must be 
prepared to drop the intermediate results which have been computed 
and pick up the correct brajich in a way very similar to that of an in- 
terrupt. 



Slamming up all these logical problems, we may state that the fundcLmen- 
tal rule for the virtual m.emory is that it must make the asynchronous 
and non- sequential computer give results identical to those which would 
be obtained by performing the programi one instruction at a time in the 
order in which they are written. 



III. DETAILED DESCRIPTION OF VIRTUAL. MEMORY OPERATION 

A. General Conditions to be Considered 

The conditions which occur in the following situations must be consider- 
ed in some detail: 

1. The fetching of instructions by the Indexing Arithmetic Unit 
(lAU). 

2. The indexing of instructions and modification of Index re- 
gisters. 

3. The loading of the virtual memory amd the setting of its 
conditions by the lAU. 

4. The action of the virtual memory in fetching data. 

5. The action of the virtual memory in storing data^ 

6. The communication between the virtual memory and the 
main arithmetic unit. 

7. Special situations such as conditional branching on arith- 
metic results, etc. 

B. Definitions 

Some of the terms we will use are defined as follows: 

1. Operations 

Operations are considered to be of three types: 

(a) Bring or Fetch Type - All instructions requiring data to be 

transmitted from external memory to the virtual memory. 



(b) Store Type - Instructions requiring the transmission of 
data from the virtual memory to external memory or index 
memory. 

(Note: We consider all indexing instructions to be of the 
store type, although the store may be to either ex- 
ternal memory or index memory. ) 

(c) Immediate Type - All operations not requiring data trans- 
mission. 

2. Virtual Memory Quantities 

(a) Virtual Memiory - A number of virtual memory (or look- 
cihead) levels (niambered to N-1). 

(b) Level of Virtual Memory - A collection of registers aind 
control bits. The contents of the j th level are shown in 
Figure 2. 

(c) Instruction Address Register (L) - Contains the address of 
the instruction currently in the j th level. 

(d) Operation Code Register (OP.) - Contains the operation to 
be performed by the arithmetic unit. 

(e) Store Bit (S:) - a one-bit trigger which indicates the level, 
contains a store type instruction. 

(f) Bring Bit (Bi) - A one -bit trigger which indicates the level, 
contains a fetch type instruction for which the data access 
has not been started. 

(g) Forwarding Bit (F-) - A one-bit trigger which indicates 
that the j th level must trajismit data to another level. 

(h) Forwarding Address (FA.) - A register which contains the 

niimber of the level to which the data must be sent if F. is 
set. 

(i) O. K. Bit (OKi) - A trigger which when set indicates that 

the correct data for the instruction to be executed is pres- 
ent in the j th data field. 

(j) Data Field (D.) - A register which contains the operand 

data for the instruction. 
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FIGURE 2. Virtual memory -- contents of one level. 
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FIGURE 3. Virtual memory interlocks. 



(k) Data Address (DA-) - The operand data address (already- 

indexed by the lAtf) for D-. 

(1) Compare Bit (C.) - A trigger which if not set indicates the 

address in DA- should not be included in amy address com- 
parisons being made. 

3. Counters 

The virtual memory is controlled by a set of counters which count 
mod(N), where N is the number of virtual memory levels. 

(a) Counter one (Cj^) - Indicates the level into which the next 
instruction may be placed. 

(b) Counter two (C2) - Indicates the level from which the next 
bring type instruction may be initiated. 

(c) Counter three (C3) - Indicates the level from which the next 
store type instruction may be initiated. 

(d) Counter four (C .) - Indicates the level from which the 
arithmetic \init will get its next operation and data. 

4. Interlocks 

The above counters must be interlocked in the following manner to. 
assure proper sequential operation of the computer (see Figure 3): 

(a) Interlock one (I^^): C]^= C3 + N Prevents the lAU from 
placing the next operation into the level indicated by C, 
because an unexecuted store is still in the level. 

(b) Interlock two (12)1 Ci = C^ Prevents a store from being 
initiated from the level indicated by C3 because the store 
has already been done. 

(c) Interlock three (I3): Cl = C^ Similar to l^* prevents a fetch 
from being initiated. 

(d) Interlock four (I4): C]^ = C4 Prevents the arithmetic unit 
from executing axi old instruction. 

(e) Interlock five (I5): C]^ = C4 + N Prevents the lAU from 
placing the next instruction into the level indicated by Ci 
because the instruction there has not been executed yet. 



C. Logic of the Virtual Memory 

1. General 

There are two basic precepts which must be kept in mind to under- 
stand the operation of the virtual memory: 

(a) The OK bit (O.) being set in the j th level indicates that 
the contents of D. is the correct data called for by DAi. 
All operations will be performed only under this condi- 
tion, and logical decisions will be made in such a manner 
as to miake sure this is the case. 

(b) Addresses can be compared by the lAU with every DA- 
address simultaneously, DAj is not used for any level which 
does not have its C- bit set. If a comparison exists between 
a new DAj being placed in the virtued memory and an old 
D-^, the compare bit Ck is turned off amd the address 

of level j is placed in FAj^. This insures a unique mean- 
ing for the comparison. If this were not done, another 
instruction address DA^ might compare against two levels 
and thus cause an ambiguity. 

2. Instruction Fetch Logic 

Figure 4 is a flow diagram, of the lAU Instruction Fetch Procedure. 
The logic is as follows: If the lAU is ready to fetch another in- 
struction, it compares the instruction address with all the DA^^s 
of virtuad memory. If there is no comparison, the instruction 
fetch is initiated. If there is a comparison, the lAU must take 
its instruction from the virtusil memory provided the OK bit is 
set; otherwise, it must wait \intil the OK bit is set. 

Note: This procedure prevents the logical difficulty m.entioned 
earlier which would occur if the virtual memory contained a store 
order into the instruction presently being fetched. 

For example: 
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FIGURE 4. Instruction fetch procedure. 
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The store to a+2 must be done in sequence or the old value N 
would be used for the address instead of the queintity being set 
by a. 

3. Indexing Logic 

Figure 5 shows the flow for instruction indexing. After deter- 
mining that an instruction is ready to be indexed, the lAU tests whether 
or not the index value is available. If it is, the indexing operation 
is started, if not, the memory reference is started ajid the lAU 
waits until the data returns before proceeding. If the index- fetch 
has not been started, the lAU compares the index address against 
all the data addresses in virtual memory. If none compare, the 
index value is fetched normally. If one does compare, the index 
fetch is held up until the OK bit is set for the data. This value 
from the virtual memory is then used for indexing the instruction. 

4. Logic of Putting Instructions in the Virtual Memory 

(a) Figures 6,' 6A, 6B, 6C represent the logical flow for putting 
instructions into the virtual memory. If the indexing arith- 
metic unit has an instruction prepared for the virtual mem- 
ory, it may trajismit the instruction into the virtual mem- 
ory if interlocks one and five do not forbid it. These inter- 
locks prohibit a new instruction from destroying an old 

one which has not been executed as yet, whether an arith- 
metic operation (I5) or an unexecuted store (Ij). The hand- 
ling of the instructions varies depending on whether they 
are of the bring type, store type, or immediate type. 

(b) The bring type, as described in Figure 6A, proceeds as 
follows: If the effective data address of the instruction 
compares with the DA address in some level, the in- 
struction, its op code, and effective data address are load- 
ed into the level marked by C^. The compare bit for level 
C, is set to one while the compare bit for the compared - 
A^ith level is set to zero. If the O. K. bit in this compared- 
with level is set, meaning that the data located there is 
correct, the data is transmitted directly to the Cj^ level 
and its O.K. bit is also set. If the O.K. bit is not set, we 
must tag the compared-with level by setting its forwarding 
bit and by putting the value oi C^ into its forwarding 
address. The bring bit for level C. is also set to zero 
since no further data fetch is required. 

If the effective data address does not compare with any Vir- 
tual Memory level, the instruction is put directly into 



11 



1 



IS THERE AN INSTRUCTION 
TO BE INDEXED 



1^ 

YES 



I 

NO 

L 



HAS INDEX VALUE 
BEEN OBTAINED 



YES 



T 

NO 



WAIT 



L_ J 



INDEX 
INSTRUCTION 



HAS MEM. REF. 
BEEN STARTED 



1^ 

NO 

i 



YES 

L 



DOES INDEX ADDRESS 
COMPARE WITH AN ADDRESS 
IN A VIRTUAL MEMORY 



NO 



i 



START MEMORY REFERENCE 
FOR INDEX VALUE 



YES 

1 



WAIT 



IS 0. K. BIT SET IN 
COMPARED WITH LEVEL 



1^ 

YES 

JL 



T 

NO 

L 



WAIT 



OBTAIN INDEX 
FROM V M 



FIGURE 5. Indexing procedure. 
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FIGURE 6. Procedure for placing instructions into the virtual memory. 
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FIGURE 6A. Logical conditions for bring type operations. 
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level Ci, its O.K. bit is set to zero, ctnd its bring bit is 
set to one, indicating that a fetch must be started. 

(c) Figure 6B shows the store type procedure. If the effective 

address of the instruction does not compare with the DA 
address in some level, the instruction is placed into the 
level marked by Ci . The store bit is set to one indicating 
that a store will be required. The level's bring bit smd for- 
warding bit are set to zero^; its compare bit is set to one. 
If on the other hand the addresses do compare, the scime 
procedure is followed; but in addition, the compare bit in 
the level compared-with is set to zero so that future com- 
parisons will not use it. 

The OK bit has not yet been set. It is set to one if the 
operation is' an index store and set to zero if it is an 
ordinary store. For the ordinary store it is clear that 
the OK bit should be zero since the data must come from 
the arithmetic unit after the preceding instruction is exe- 
cuted. 

As was mentioned in the definition previously we treat all 
indexing, instructions as store type and place the new value 
of the indexed quantity into the virtual memory. This is 
done because the indexing arithmetic unit is going ahead 
of the normal order of instruction execution and an inter- 

1*11 T^f"? r»r» TY-> a tr rsff^iTif Tnc^fr^ycx i-V,A a i -r, A ex-vi -n n A-n ot-t^ti^i-i ^-^ nV^^.I^ 

have been done. In this case, the old value of the index is 
still in the index register. On the other hand the indexing 
arithmetic unit compares with the virtual memory and 
extracts the most recent value of the index for indexing 
succeeding instructions. The OK bit is set to one since the 
appropriate data is in the above level. Both the new and old 
index values must be carried along to give logically correct 
conditions in the case of an interrupt. 

A situation very similar to interrupt occurs in branches on 
arithmetic results where the indexing arithmetic unit 
"guesses" which branch will be taiken and proceeds with 
fetching and processing the instructions on this branch, 
subject to being wiped out if the guess proves to be wrong. 
(See the discussion on "Wrong way Branches" below. ) 

(d) Immediate type instructions are the simplest type because 

they essentially carry their data with them. Figure 6C 
shows the logic in this case. The instruction is placed in 
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FROM FIGURE 6 
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FIGURE 6B. Logical conditions for store type operations. 
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FIGURE 6C. Logical conditions for immediate type operations, 
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the virtual memory level marked by C^. The address field 
of the instruction is placed in the data field of C;^. The OK 
bit is set to one indicating the data is present. The bring 
and store bits are both set to zero. The compare bit is 
set to zero since the DA address field has no meaning for 
immediate type ops. (The data address of the last instruc- 
tion which occupied this level still remains in DA, so it has 
no relation to the present D field. ) 

5. Logic of Data Fetching (See Figure 7) 

When an instruction of the bring type has been placed in the virtual 
memory, the data required by the instruction in general will not 
be present (unless a comparison exists as was described above) 
and thus the data must be obtained from core storage. The fetch 
cannot be started if interlock I3 holds, which means all the fetches 
corresponding to the instructions presently in the virtual memory 
have been started. If a fetch is possible, the bring bit at level C? 
indicates whether or not a fetch is necessary. If necessary the 
fetch may be started if the memory bus and memory unit correspond- 
ing to the data address are not already being used. When the fetch 
is started, the bring bit for level C^ is set to zero. The counter 
C is then stepped forward to the next level. 

6. Logic of Data Storing 

Figure 8 shows the Data Store Logic, which is very similar to that 

for data fetching just described. The only significant difference 

is that the O.K. bit must be set before the operation can be started. 

7. Logic for Placing Data into the Virtual Memory 

In Figure 9, we see the logical conditions which must be satisfied 
by the virtual memory. The return address which was supplied 
when the fetch was started selects the level into which the data 
will be placed. The O. K. bit is then set to one, indicating that 
the proper data is in the level. The operation is complete at this 
point unless the forwarding bit is set. In this case, the data must 
be forwarded to the level designated by the forwarding address. 
This procedure continues from level to level as long as the data 
continues to arrive into a level whose forwarding bit is set. This 
procedure automatically supplies all operands present having 
identical data addresses with the proper data, without additional 
memory references. 
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FIGURE 7. Data fetch procedure. 
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FIGURE 8. Data store procedure. 
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FIGURE 9. Procedure for placing data into virtual memory. 
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8. Logic of Removing Instructions from the Virtual Memory 

Observing Figure 10, we notice that as the aritiimetic unit completes 
an instruction it checks to see if the next instruction in the virtual 
memory is ready to be executed (indicated by interlock I4). 

Note: The operation may be an unconditional branch, a conditional 
branch, or an index type store, as well as a normal bring or store 
type instruction involving the accumulator. Figure 10 shows only 
the cases which involve the universal accumiulator. Instructions 
such as the unconditional branches are merely ignored at this 
point. They are carried along only to provide the data for recovery 
in the event an interrupt occurs. The execution of the conditional 
branches on arithmetic results is described in the next section. 

If the next instruction marked by counter C4 is ready, it is fed into 
the arithmetic unit. If it is a store type, the data is gated fromi 
the accumulator into the data field of level C4, and the OK bit is set 
to one. If the forwarding bit of the level is set, a forwarding pro- 
cedure in this case is essential for the proper logical operation of 
the computer, whereas in the bring case it is a time- saver only. 

If the instruction is not a store type, the arithmetic unit must hold up 
until the O. K. bit for the level is set When the O. K. bit is set, the 
instruction is gated into the arithmetic unit and executed. 

9. Logic of Interrupt Procedure 

If for any cause an interrupt (or trap) from a special condition 
occurs, the instruction which is being executed in the arithmetic 
is completed. However, the next instruction is not executed in 
spite of the fact all the data preparation for it may have been com- 
pleted. The address in the lA (instruction address) field will serve 
as the value to reset the instruction counter if it is desired. The 
Virtual Memory is initialized, i. e. , set to the starting conditions 
of an interrupt, with the exception that all store orders which have 
already received data from the accumulators must be executed first. 

Note: If the interrupt is of such a nature that the normal flow of 
instructions is not resumed, the procedure of storing the modified 
values of the index registers in the Virtual Memory gives logically 
correct results, i. e. , the same as if the interrupt had occured 
before the indexing took place. 
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FIGURE 10. Procedure for removing instructions from virtual memory. 
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IV. DESCRIPTION OF TIMING SIMULATION PROGRAM 

A. General Considerations 

During the logical design of Stretch it was necessary to prove the value 
of the virtual memory concept and to assist in the selection of optimum 
values of various system design parajneters. Exajnples of such para- 
meters are: The number of memory boxes, interlace and allocation 
of memory addresses, and numbers of virtual memory levels. Also 
of interest were trade-off factors for speeds of indexing arithmetic iinit, 
arithmetic unit, memiories, etc. 

In November 1957 the Timing Sinnulator (SIM - 2) described here was 
written for the IBM 704. This prograxn attempted to answer iuch ques- 
tions quantitatively by simulating the timie-wise operation of Stretch on 
typical test programs coded in Stretch language. 

The basic logic of the 704 progrsim follows the principles just described 
in the preceding section for the virtual memory. It should be stressed 
that the simulator is a timing simulator and does not execute the in- 
structions in an arithmetic sense. It traces the time-wise progress of 
the instructions through the components of the computer, observing all 
the interlocks and time delays necessary for correct representation of 
the behavior of the machine. 

One. of the fundamental concepts in the Stretch design is that of asynchron - 
ous operation of the components. This means that there are a large 
nurnber of logical steps being executed at any one time in the computer, 
each of them proceeding at its own rate. To simulate this flow of many 
parallel continuous operations, we have broken the continuous time vari- 
able into finite time steps. The basic time step is taken as 0. 1 micro- 
second in the simulator. 

By taJcing 0, 1 microsecond as our quantum of time, we are automatically 
setting the scale of the smallest circuit entities which we will consider 
as being those which accomplish complete functions in 0. 1 microsecond 
or few multiples thereof. Thus, by using this philosophy, and considering 
many of the components of the computer as "black boxes", we greatly 
simplify the details which must be considered without introducing serious 
timing inaccuracies. 

Our experience has indicated that more information was gained by making 
a large number of fast parameter studies using different configurations 
and progrsums than could have been obtained by a very slow, detailed 
simulation of a few runs with more precision per run. Even so, our 
time scale is too fine to make serious input-output application studies. 
These would require a sim^'^ler simulator having at least a factor of 10 
coarser basic time interval. 
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B. Logic of the Simulator 

In the asynchronous orgajiization of Stretch there caji be many major 
components operating at ajiy one time. To achieve this parallel effect 
in the simulator we essentially "hold time still" and scan the entire 
machine representation at each time step. Although every major block 
of the progrcum is traversed at each time step, if there is no activity 
required in a given block, only a few tests need be made by the code. 

If in this process it is determined tiiat a given logical unit should do an 
operation, the time interval required for the operation is obtained from a 
table of constants. The speed of the various logical units can thus be 
changed paraime trie ally by changing the values in the tables. A constant 
obtained from the tables is inserted into a memory location called the 
timie counter for that unit. At each time step the program reduces this 
counter by one until it reaches zero. Thus, the fact that the counter is 
non-zero can be used to indicate that the particular logical unit is busy 
and not available to service other requests. When the coiinter is zero 
the unit cam consider a new input. 

In addition to the time co\inters many of the logical blocks contain otiier 
conditions or interlocks which affect the operation of the block. These 
conditions are stored in the program and tested before action is under- 
taken. 

It is interesting to note that since the simulator simulates timing only, 
not the arithmetic or indexing functions, the sequence instructions to be 
executed must be furnished as a "string" with all loops unwound. How- 
ever, to matke the computer behave as it actually would, the loops must 
be furnished with "wrong way" paths given for the cases where the 
computer would taJke such paths. Also one must furnish more than 
enough information along such paths since it is difficult to predict in 
advance how far the computer will get down the wrong path before it is 
called back. 

Parameters are chstnged from one run to amother by use of control cards. 
The control cards are set up in such a way that any nximber of parameters 
may be changed between runs. Results are given either as detailed tim- 
ing charts or as summary listings for each problem. The usual pro- 
cedure has been to print only summary results while maiking a series of 
parsimeter studies. The detailed timing charts as printed on the 704 
for most problems would be about 50 feet long for each run. Since over 
1000 cases have been run, it is clear that only a few cases could be printed 
in fxill detail. These are particularly usef\il in seeking the causes of 
conflicts which slow the computer. 
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C. Results of Parameter Studies 

When the simulator program was completed, we undertook a series of 
studies in which the main parameters describing the Stretch system were 
varied one or two at a time in order to get a measure for the importance 
of different effects. After this we began to specialize the studies towards 
answering specific questions in the Stretch design. 

The simiplified flow diagrcim in Figure 11 indicates the order in which 
the subroutines for the various logical units are executed at each time 
step. Using the types of techniques just described above, the logical 
subroutines simulate the action of the components of the computer such 
as the virtual memory, arithmetic unit, etc. 



V. SOME RESULTS OF THE SIMULATION STUDIES 

Figure 12 shows exaxnples of the type of output listings given by the sim- 
ulator. Figure 12 is a piece of a long timing chart with each line of 
printing representing 0. 1 microsecond of time. The columns represent 
the various components of the computer. On the left and right are timing 
counts subdividing each microsecond. On the far right are conflict 
indicators ("C" on the charts) and waiting indicators, "W", which indicate 
when interlocks prevent operations from proceeding. 

The 2nd column, II, gives the number of the instruction being indexed. 
The 4th column, AU, gives the number of the instruction using the 
arithmetic unit. The next four col\imns represent the instructions using 
the memory buses. The columns labeled X-, F-, and M- represent the 
index, fast, and main miemories. A string of "X' s" in the columns re- 
presents the cycle time of the memory. The number indicates the 
instruction using the memory and the nximber of timies which it is re- 
peated gives the readout time of the memory. The columns L- indicate 
which instruction is located in the virtual memory levels. The other 
columns are for details in analysis and need not be considered here. 

Five of the test problems used most frequently are described below. 
Other test problems were used for specific studies, but since the results 
were similar for all problems of a given type, we gradually discontinued 
using them. The following were originally selected as being typical of 
different classes of problems. 

1. Mesh Problem - Part of an hydrodynajmics problem from Los 

Alcunos. It contains a more or less "average" mixture of instruc- 
tions for scientific problems: 85% floating point instructions, 14% 
index modification instructions, and 1% VFL. It is usually arithme- 
tic unit limited. 
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► I INITIALIZATION 

• 2 ARITHMETIC UNIT 

3 DECODE OPERATIONS 

4 VIRTUAL MEMORY 

5 INDEXING ARITHMETIC UNIT 

6 BUS FROM MEMORY 

7 BUS TO MEMORY 

8 1/0 REFERENCES TO MEMORY 

9 V.M. STORE REFERENCES TO MEMORY 
10 V.M. FETCH REFERENCES TO MEMORY 
I I I.A.U. REFERENCES TO MEMORY 

12 INSTRUCTION FETCH REFERENCES TO MEMORY 

13 COUNT- DOWN TIME 

14 PRINT DETAILED LISTING 

15 SUMMARIZE AND PRINT 



FIGURE 11. SIM - 2 simplified flow diagram. 
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PROJECT 7000 SIMULATOR 2 COCKE & KOL8KY NOV 67 



FIGURE 12. Listing of simulator print-out. 
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2. Monte Carlo Branching Problem - Part of an actual Monte Carlo 
neutron diffusion code. It represents a chain of logical decisions 
with very little arithmetic in between. It contains 47% floating point, 
15% index modification instructions, and 36% branches of the indi- 
cator and unconditional types. It is largely instruction-access 
limited, 

3. Reactor Problem - The inner loop of a neutron diffusion problem^ 
It consists of 90% floating point arithmetic (39% of which are mul- 
tiplys) and 10% index modification instructions. It is almost en- 
tirely arithmetic unit limited. 

4. Computer Test Problem - The evaluation of a polynomial using 
computed indices. It has 71% floating point, 10% index modifica- 
tion, 6% VFL and 13% indicator branches. It is usually arithmetic 
unit limited, but not for all configurations. 

5. Simultsmeous Equations - The inner loop of a matrix inversion 
routine 67% floating point and 33% index modification. Arithmetic 
and logic are about equally important. It is limited both by arith- 
metic cLnd instruction-access speeds. 



A . Speed vs Number of Levels of Virtual Memory 

Figure 13 shows the effect on computer performajice of varying the num- 
ber of levels of virtual miemory. Curves for the Monte Carlo and Mesh 
Calculations with two sets of arithmetic and indexing arithmetic speeds 
are shown. The AU times given are averages for all operations. A 
number of interesting results are apparent from these curves: 

1. There is a tremendous gain to be had in going to the virtual memory 
organization. The point for "0 levels" means that the arithmetic 
unit is tied directly to the instruction preparation unit, although 
simiple Indexing -Execution overlap is still possible, 

2. The gain in performance goes up very rapidly for the first two 
levels then rises more slowly for the rest of the range. 

3. A large number of levels does the Monte Carlo problem less good 
than the Mesh problem because constant branching in the former 
spoils the flow of instructions. Notice that the curve for the 
Monte Carlo problem actually decreases slightly beyond six levels. 
This phenomenom is a result of memory conflicts caused by ex- 
traneous memory references started by the computer running 
ahead on the wrong -way paths of branches. 
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FIGURE 13. Computer speed vs. number of levels of look-ahead registers: 
4 main memories 2,0 ji sec; 2 fast nnennories 0,6 jjLsec; for 
two sets of arithmetic speeds. 
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4. The computer performance on a given problem is clearly less for 

slower arithmetic speeds. However, it is important to note that 
the sensitivity of the performance is also less for slower arithme- 
tic speeds. The virtual miemory improves the performiance in 
either case, but it is not a substitute for a fast arithmetic unit. 



B. Speed vs Number of Main Memory Units 

Figure 14 shows how internal computer performance varies with the 
total number of memory units for a particular problem. The entire 
calculation is assumed to be contained in memory for all cases. The 
speed gain from overlapping memories is quite apparent from the graphs. 

The speed differential between having and not having instructions separated 
from data arises from delays in instruction fetches caused by the mem- 
ory units being busy with data. The size of this effect varies from problem 
to problem, being less |)ronounced for problems which are arithmetic 
limited and more for logical problems. 

The "X's" on the graph show the effect of replacing the 0. 6 usee in- 
struction memories by a pair of 2. usee memories. The resulting 
performance chatnge is small for the Mesh problem, which is arithmetic 
limited, but large for the instruction-fetch limited Monte Carlo problem. 



C. Speed vs Arithmetic Unit and Indexin g Arithmetic U nit Times 

Although everyone realizes the importance of arithmetic speed on overall 
computer performance, it was not until the simulator results becajne 
available that the true importance of the indexing arithmetic speeds 
was recognized. Figures 15 and 16 show a two parameter family of 
curves giving the computer speed as a function of the AU and lAU times. 

Figure 16, in which the arithmetic time is the abscissa, shows an in- 
teresting "saturation" effect where the computer performajice is inde- 
pendent of AU speed below somie critical value. Thus it makes no sense 
to strain AU speeds if the lAU is not improved to match. The curves in 
Figure 15 show the sajne effect, i. e. , the lAU speed serves as a 
'ceiling" on performance beyond which the AU speed cannot pass. 



D. Arithmetic Unit Efficiency 

One fallacy which is frequently quoted is that the goal of improved com- 
puter orgajiization is to increase the arithmetic unit efficiency. Actually 
there are two reasons why this is not the goal in itself. The first is 
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that aritinnetic efficiency depends strongly on the mixture of arithmetic 
and logic in a given problem so that a general purpose computer cannot 
hope to give equally high percentage utility to all. The second reason is 
that the amplest way to increase the arithmetic unit efficiency in ajiy 
asynchronous case is to slow down the arithmetic unit'. 

The real goal of improved organization is maximum overall computer 
performance for minimum cost. One will tend to increase the arithmetic 
unit speed as long as its percent efficiency is reasonable for a variety 
of problems. One will stop this process when the overall performauice 
gain no longer miatches the increase in hardware and complexity. Thus 
the arithmetic unit efficiency is a by-product of this design process, not 
tiie prime variable. 



E. Speed vs. Concurrent Input-Output Activity 

Because of the relative timie scales of I/O activity and the CPU pro- 
cessing speeds, the simulator cainnot taice in account the availability 
or non- availability of data from I/O on the program being run. However, 
we can observe the effect on the computation of the I/O devices operating 
at different rates simultaneously with computing. 

Using the Stretch control word philosophy, it is possible to have a num- 
ber of input-output units operating at the saime time the Central Pro- 
cessing Unit is running. The Basic Exchange can reach a peak rate of 1 
word every 10 microseconds. The high speed disk normally operates at 
1 word every 4 microseconds. Since the mechanical devices take priority 
over the CPU in addressing memory, the computation slows down because 
of memory-busy conflicts. 

Figure 17 shows an example of how internal computing speed is slowed as 
the I/O word rates are varied continuously. At the theoretical "choke off" 
the I/O devices take all the memory cycles available and stop the calcula- 
tion. Notice that this condition can never arise for any I/O rates presently 
attainable. 

A Stretch systemi with only 1 or 2 memory units has less perform.ance than 
a larger one for three reasons: (1) The top speed of the system is re- 
duced by the loss of memory overlap, (2) it has a larger I/O penalty 
when I/O is run concurrently with the computation, and (3) the smaller 
amount of data which caji be held in the memory at one time increases 
the amount of I/O activity needed to do the job. Note, however, that 
increasing the memory size on a computer of conventional organization 
only improves the third area. 
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FIGURE 17. Internal computing speed. Percentage reduction in speed 

caused by input -output devices referencing memory at different 
rates while the calculation is proceeding. 
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F. A Study of Branching on Aritiunetic Results in Stretch 

One penalty of the non- sequential preparation and execution of instruc- 
tions used in Stretch is that if there is a branch in the problem code 
it spoils the smooth flow of instructions to the indexing arithmetic unit. 
Any brajich in a program will cause some delay, but the m.ost serious 
ones are the branches on arithmetic results which cannot be detected 
by the indexing arithmetic unit in advance. 

There are two fundamental ways in which branches on arithmietic unit 
results can be handled by the computer. 

1. The computer can stop the flow of instructions until the arithmetic 
unit has completed tiie preceding operation so that the result is 
known, then fetch the next correct instruction. This places a 
delay on every AU result bramch whether taken or not. 

2. The computer can "guess" which way the branch is going to go before 
it is taken smd proceed with fetching ajid preparing the instructions 
along one path with the understanding that if the guess was wrong, 
these instructions must be discarded and the correct path taken 
instead. 

A detailed series of simulator runs were made to study this situation and 
to decide which way Stretch should be designed. Some of the general 
observations were: 

1. The performajice variation in a problem with considerable arithme- 
tic data brajiching can vary by approximately + 15% depending on 
the way in which the branches are handled. 

2. Holding-up on every branch seemts to be less desirable than any 
of the guessing procedures. Some time is lost whenever a branch 
is executed rather than proceeding to the next instruction. Unless 
there is an unusual situation in which there is a very large pro- 
bability that the branch will always be taiken, the least time will be 
lost if one assum.es that the branch is not taken. 

3. The theoretically highest performajice would be obtained if each 
branch had am extra "guess bit" which would permit the prograjnmer 
to specify which way he estimates each branch will most likely go. 
However, this would place a considerable extra burden on the pro- 
grammer for the gains jiromised. (It also uses up m.any valuable 
OP codes. ) 
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4. It is realized that there is a "feedback" in such decisions because 
the way in which the machine guesses the branches will influence 
future programmers to write their codes to taJce advantage of the 
speed gain. The result is that the statistics of the future will be 
biased in favor of the system chosen for the miachine, and thus 
"prove" that it was the right decision.' 
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